# A tibble: 1 × 4
mean sd se count
<dbl> <dbl> <dbl> <int>
1 51.7 12.0 0.834 208
data wrangling and types of variable names
meta data
project design
summary statistics
graphing the mean and standard error graphs
pipes and %>% or |> and how to group_by
Our last graph
we are going to use some sculpin data that is real!
# A tibble: 1 × 4
mean sd se count
<dbl> <dbl> <dbl> <int>
1 51.7 12.0 0.834 208
What is a frequency distribution?
# A tibble: 28 × 2
length_bin n
<fct> <int>
1 [11,13] 4
2 (19,21] 1
3 (23,25] 1
4 (27,29] 2
5 (29,31] 2
6 (31,33] 1
7 (33,35] 4
8 (35,37] 3
9 (37,39] 7
10 (39,41] 9
# ℹ 18 more rows
What happens as sample size changes…
Can we make assumption about distribution of random variable weight in population?
Probability distribution:
Now we could look at a lot of different ranges of lengths
probability of the length larger than the mean
probability of the length larger than 70 mm
probability of the length between two numbers
Normal (Gaussian): symmetrical, bell-shaped
Defined in terms of mean and variance (μ, 𝜎2)
SND (z-distribution) has mean μ=0 , 𝜎2 =1
defined in terms of μ or mean
Right-skewed at small μ
more symmetrical at higher μ
Random sampling crucial for
sample -> population
inference statistics -> parameters
Two main kinds of summary statistics: - center and spread
Center: - Mean (µ, ȳ): sum of sampled values divided by n - Mode: the most common number in dataset - Median: middle measurement of data; = mean for normal distributions
Mean
Formula for n odd
Formula for n even
E.g., fish lengths = 20, 30, 35, 24, 36 g
# A tibble: 1 × 1
mean
<dbl>
1 29
Spread
(20 -29)^2+ (30 -29)^2 + (35 -29)^2 + (24 -29)^2 + (36 -29)^2 = 57,104
192 / (5-1) = 48 mm^2 Problem: weird units!
# A tibble: 1 × 2
mean variance
<dbl> <dbl>
1 29 48
In same units as observations
In example: √48 = 6.9 mm
Problem: - don’t know the values of parameters
Goal: - estimate parameters from empirical data (samples)
3 general methods of parameter estimation: - Maximum Likelihood Estimation (MLE) - Ordinary Least Squares (OLS) - Resampling techniques
MLE general method to estimate parameters in a way that maximizes the likelihood of the observed data given the parameter values.
aims to find the parameter values that make the observed data most probable under the assumed statistical model.
OLS specific method to estimate parameters of a linear regression model.
minimizes the sum of the squared differences between observed and predicted values